This notebook details my approach to building predictive models for newly released games on boardgamegeek.com (BGG). Specifically, I am interested in taking any newly released boardgame and, using features that are available at the time of its release, estimating how it will be received on BGG: its average rating, number of user ratings, and complexity rating.
While the goal of this project is ultimately to yield accurate predictions for upcoming games, we are also interested in understanding what the models learn. What features of games are associated with high/low average rating? Why do some games receive high numbers of user ratings? What types of games are the most complex?
To answer these questions, we’ll make use of historical data from boardgamegeek. We will connect to a database on GCP containing a variety of tables on game features and their current ratings on BGG. For this analysis, in training models, we will restrict ourserlves to games published through 2018. We will validate the performance of our models by evaluating their performance in predicting games published in 2019.
The data we are using comes from boardgamegeek.com, which we access by using the open BGG API. We are training models on data that last pulled from BGG on 2022-03-18.
We will be training models at the game-level, where every row corresponds to one game and every column corresponds to a feature of the game.
As of our most recent pull, our dataset contains 97845 games. This is the entirety of games on BGG, many of which are unpublished prototypes and have not received any ratings by the BGG community.
If we filter to games with a minimum of 30 user ratings, we have only 22347 games.
For the bulk of this analysis, we will be training on games that have achieved at least 30 user ratings. This is a design decision to restrict our sample to games that 1) have received some evaluation from the community and 2) speed up the time in training models. We can later view this as a parameter for tuning, allowing more or less historical games to enter the model for training. Based on some initial tests, 30 was a useful cutoff point for both model performance and training time.
We will set up a training and validation split based on time. First, we’ll train models on games published through 2018, then evaluate their performance in predicting games published in 2019 and 2020. We will then make our model selection and retrain the models on all games published through 2020 in order to predict upcoming games.
We will be modeling four different outcomes: average weight, average rating, user ratings, and the geek rating. The geek rating is itself a combination of the average rating and number of user ratings, but I will be interested to see how well we do in modeling it directly vs modeling the underlying components and computing it.
Our model training and evaluation plan will look something like this:
We are interested in modeling a number of different outcomes: a game’s average rating, complexity rating, and number of user ratings.
These outcomes aren’t independent, as complexity and the average rating are highly correlated.
As we will see, this means if we want to predict a game’s average rating, the most important feature is usually its average weight. But because these a game’s average rating and complexity are both voted on by the BGG community, we won’t know a game’s average rating at the time of its release. This means for newly upcoming games, we will first use a model to estimate a game’s complexity and then use that estimate as the input into our average rating model.
What features do we have about games? We have basic information about every game, such as its player count and playing time, and we also have many BGG outcomes, such as the number of comments, number of people trading, which we will not use in predicting the outcomes we care about. We have some missingness present in the playing time variables that we will address in our recipe preparing the data.
We also have a variety of information about game mechanics, categories, artists, publishers, designers, artists, and so on. Some of these categories are not observed for every game, such as if a game doesn’t have expansions or integrations with other games.
This means there are ~180 different mechanics, ~20k publishers, ~ 30k designers, and ~ 22k artists present in our training set. This is good in the sense that we have ample information about games for models to look at and use in training, but bad in the sense that if we threw all of it into a model we would quickly run up against the the curse of dimensionality.
type | n_types | n_games |
publisher | 21,761 | 97,816 |
category | 84 | 96,064 |
designer | 31,574 | 84,391 |
mechanic | 182 | 83,648 |
family | 4,473 | 66,392 |
artist | 22,201 | 42,386 |
implementation | 8,792 | 8,790 |
expansion | 27,440 | 7,764 |
integration | 3,787 | 3,646 |
compilation | 3,079 | 2,152 |
How can we make use of this information for modeling? We could create dummy variables for every different type, but this will quickly create thousands of features, many of which are going to contain little information. We would view this as a P > N problem and let the data speak for itself via methods of feature selection and dimension reduction.
Alternatively, every game had only one mechanic/designer/publisher, we could mean encode on the training set. For instance, instead of using thousands of dummy variables for each designer, we would have one ‘designer_mean’ feature that is simply the value the designer’s mean value in the training set. This can dramatically reduce the dimensionality of categorical features while keeping the information we want.
For our purposes, the hang up with taking a simple mean encoding approach is that a game may have multiple designers, categories, mechanics, artists, and publishers. For designers we might be able to get by with taking the mean of the designer means, but it starts to get more complicated with mechanics - most games have multiple different mechanics, and its the combination of different mechanics that are we interested in exploring. The other complication is that some designers have only designed a handful of games, while others have designed hundreds, so the mean may not impart the same amount of information.
On top of all of this, we have to be careful in what features we allow to enter a model, as some of the categories about games are themselves a reflection of the outcomes we want to predict.
With all this in mind, we’ll do bit of inspection to figure out which features of games we’ll allow to enter our training recipe, in essence using a manual filtering method to select features.
One set of features relates to a game’s “family”, which is sort of a catch all term for various buckets that games might fall into: Kickstarters, dungeon crawls, tableau builders, etc. Some of these are likely to be very useful in training a model, while others should be omitted. We don’t, for instance, want to include whether a game has digital implementations, as these are a reflection of a game’s popularity. These sets of features also have a very long tail, with some families only having one or two games in them. We’ll filter to remove families with near zero variance, removing features on this variable that apply to a little less than 1% of games.
Some features we won’t include, such as the Mensa Select or implementations on BoardGameArena/Tabletopia, as these are outcomes that typically occur when a game has been popular and shouldn’t be used as predictors.
We’ll do the same thing for categories, but this variable is much smaller and generally pretty well organized.
We’ll include all of these, though there will likely be some overlap between these and other features which we can take care of with a correlation filter.
Mechanics are also pretty well organized, so we don’t have to do much filtering.
We’ll just keep all of the mechanics, as these are the main features of games that we’ll focus our attention on.
How should we handle artist and designer effects? We’ll use a much lower minimum proportion here, as very few designers would have designed ~ 100 games.
This amounts to allowing for designers once they have released about 20 games. We’ll more or less take the same approach for artists.